Locating security vulnerabilities in source code

ABSTRACT

A tool ( 22 ) automatically analyzes application source code ( 16 ) for application level vulnerabilities. The tool integrates seamlessly into the software development process, so vulnerabilities are found early in the software development life cycle, when removing the defects is far cheaper than in the post-production phase. Operation of the tool is based on static analysis, but makes use of a variety of techniques, for example methods of dealing with obfuscated code.

CROSS-REFERENCE TO RELATED APPLICATIONS

This Application claims the benefit of U.S. Provisional Application No.60/853,349, filed Oct.19, 2006, which is herein incorporated byreference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to vulnerability assessment of computer software.More particularly, this invention relates to scanning application sourcecode automatically to detect application level vulnerabilities.

2. Description of the Related Art

Enterprise security solutions have historically focused on network andhost security, e.g., using so-called “perimeter protection” techniques.Despite these efforts, application level vulnerabilities remain asserious threats. Detection of such vulnerabilities has been attempted bylexical analysis of source code. This typically results in large numbersof false positive indications. Line-by-line code analysis has beenproposed. However, this has proved to be impractical, as modern softwaresuites typically have thousands of lines of code. Indeed, even inrelatively compact environments, such as J2EE™ (Java™ 2 StandardEdition), a runtime module may include thousands of classes.

One technique for detection of vulnerabilities is exemplified by U.S.Patent Application Publication No. 2006/0253841, entitled “SoftwareAnalysis Framework”. This technique involves decompilation to parseexecutable code, identifying and recursively modeling data flows,identifying and recursively modeling control flow, and iterativelyrefining these models to provide a complete model at the nanocode level.

Static analysis of program code is disclosed in U.S. Patent ApplicationPublication No. 2005/0015752, entitled “Static Analysis Based ErrorReduction for Software Applications”. A set of analyses sifts throughthe program code and identifies programming security and/or privacymodel coding errors. A further evaluation of the program is thenperformed using control and data flow analyses.

Another approach is proposed in U.S. Patent Application Publication No.2004/0255277, entitled “Method and system for Detecting Race ConditionVulnerabilities in Source Code”. Source code is parsed into anintermediate representation. Models are derived for the code and thenanalyzed in conjunction with pre-specified rules about the routines todetermine if the routines possess one or more of pre-selectedvulnerabilities.

Some attempts have been made to examine source code. U.S. PatentApplication Publication No. 2003/0056192, entitled “Source Code AnalysisSystem and Method”, proposes building a database associated with asoftware application. A viewer provides access to the contents of thedatabase. Relevant information may then be displayed, includingmodule-to-module communication, calls made to databases or externalfiles, and variable usage throughout the application. Presumably, theoperator would be able to identify vulnerabilities from the display.

SUMMARY OF THE INVENTION

According to aspects of the invention, an automatic tool analyzesapplication source code for application level vulnerabilities. The toolintegrates seamlessly into the software development process, sovulnerabilities are found early in the software development life cycle,when removing the defects is far cheaper than in the post-productionphase. Operation of the tool is based on static analysis, but makes useof a variety of techniques, for example, methods for dealing withobfuscated code.

An embodiment of the invention provides a data processing system fordetecting security vulnerabilities in a computer program, including amemory having computer program instructions stored therein, an I/Ofacility, and a processor accessing the memory to read the instructions,wherein the instructions cause the processor to receive source code tobe analyzed via the I/O facility, the source code including codeelements and statements, at least a portion of the statementsreferencing variables, and the variables including data structureshaving member variables. The processor is operative to construct anobject-oriented model of the source code by assigning respectiveidentifiers to the member variables. The processor is operative, usingthe model, to construct a control flow graph including nodes, derive adata flow graph from the control flow graph, derive a control dependencegraph from the control flow graph, analyze the control flow graph, thedata flow graph and the control dependence graph to identify a portionof the source code having a security vulnerability by identifyingreferences to a predetermined member variable using the respectiveidentifiers thereof, wherein the member name of the predetermined membervariable is identical to the member name of another member variable, andto report the security vulnerability.

According to an aspect of the data processing system, the processor isoperative to modify the source code to remove the securityvulnerability.

According to another aspect of the data processing system, the processoris operative to code-slice the control dependence graph to define blocksof the control dependence graph that represent atomic elements of thesource code, wherein no more than a single action is performed.

According to yet another aspect of the data processing system, theprocessor is operative to identify data flow nodes in the data flowgraph wherein input data is validated, and verify that the input data isvalidated in the identified data flow nodes in accordance with apredetermined specification.

According to a further aspect of the data processing system, theprocessor is operative to apply software fault tree analysis to thesource code.

According to an aspect of the data processing system, the processor isoperative to generate test cases for a data validation function toidentify scenarios wherein the data validation function fails.

An embodiment of the invention provides a data processing system fordetecting security vulnerabilities in a computer program, including amemory having computer program instructions stored therein, an I/Ofacility, and a processor accessing the memory to read the instructions,wherein the instructions cause the processor to receive source code tobe analyzed via the I/O facility, the source code including codeelements and statements, at least a portion of the statementsreferencing variables. The processor is operative to construct anobject-oriented model of the source code, wherein the code elements arerepresented by respective objects, using the model, construct a controlflow graph including nodes, derive a data flow graph from the controlflow graph, derive a control dependence graph from the control flowgraph, analyze the control flow graph, the data flow graph and thecontrol dependence graph to identify a portion of the source code havinga security vulnerability, by traversing a first portion of the controldependence graph a first time, and marking a traversed segment of thecontrol dependence graph, and thereafter traversing a second time asecond portion of the control dependence graph that includes the markedsegment by skipping the marked segment, and report the securityvulnerability.

An embodiment of the invention provides a data processing system fordetecting security vulnerabilities in a computer program, including amemory having computer program instructions stored therein, an I/Ofacility, and a processor accessing the memory to read the instructions,wherein the instructions cause the processor to receive source code tobe analyzed via the I/O facility, the source code including codeelements and statements. The processor is operative to construct anobject-oriented model of the source code, wherein the code elements arerepresented by respective objects, using the model, construct a controlflow graph including nodes, derive a data flow graph from the controlflow graph, derive a control dependence graph from the control flowgraph, analyze the control flow graph, the data flow graph and thecontrol dependence graph to identify a portion of the source code havinga security vulnerability by identifying in the data flow graph firstdata flow nodes wherein input is accepted, second data flow nodeswherein data is validated, and third data flow nodes wherein data isconsumed, removing the second data flow nodes from the data flow graph,thereafter determining that one of the third data flow nodes isconnected to one of the first data flow nodes by one of the data flowedges, and to report the one third data flow node as having anunvalidated input vulnerability.

An embodiment of the invention provides a data processing system fordetecting security vulnerabilities in a computer program, including amemory having computer program instructions stored therein, an I/Ofacility, and a processor accessing the memory to read the instructions,wherein the instructions cause the processor to receive source code tobe analyzed via the I/O facility, wherein the code elements arerepresented by respective objects. The processor is operative, using themodel, to construct a control flow graph including nodes, wherein thecontrol flow graph describes a plurality of functions in the sourcecode, the variables further comprise global variables, and the globalvariables are passed to the functions as a super-global variable havingthe global variables as data members thereof, derive a data flow graphfrom the control flow graph, derive a control dependence graph from thecontrol flow graph, analyze the control flow graph, the data flow graphand the control dependence graph to identify a portion of the sourcecode having a security vulnerability, and report the securityvulnerability.

An embodiment of the invention provides a data processing system fordetecting security vulnerabilities in a computer program, including amemory having computer program instructions stored therein, an I/Ofacility, and a processor accessing the memory to read the instructions,wherein the instructions cause the processor to receive source code tobe analyzed via the I/O facility, the source code including codeelements and statements, at least a portion of the statementsreferencing variables. The processor is operative to construct anobject-oriented model of the source code, wherein the code elements arerepresented by respective objects, using the model, construct a controlflow graph, derive a data flow graph from the control flow graph, thedata flow graph including data flow nodes and data flow edges connectingthe data flow nodes. The processor is operative to derive the data flowgraph by associating a first array and a second array with each of thedata flow nodes, wherein the first array holds static informationregarding ones of the variables on which its respective associated dataflow node depends, and the second array holds information thatidentifies other variables that influence the associated data flow node,the other variables being associated with others of the data flow nodes.The processor is operative to perform a traversal of the control flowgraph, and at each of the nodes thereof establish the information in thesecond array of a corresponding data flow node in the data flow graph,and responsively to the information, to construct data flow edges toconnect data flow nodes with the others of the data flow nodes,respectively, derive a control dependence graph from the control flowgraph, analyze the control flow graph, the data flow graph and thecontrol dependence graph to identify a portion of the source code has asecurity vulnerability and report the security vulnerability.

An embodiment of the invention provides a data processing system fordetecting security vulnerabilities in a computer program, including amemory having computer program instructions stored therein, an I/Ofacility, and a processor accessing the memory to read the instructions,wherein the instructions cause the processor to receive source code tobe analyzed via the I/O facility, the source code including codeelements and statements, at least a portion of the statementsreferencing variables, the variables including member variables, themember variables has member names, construct an object-oriented model ofthe source code, wherein the code elements are represented by respectiveobjects. Using the model, the processor is operative to construct acontrol flow graph including nodes, each of the nodes has a topologicalorder in the control flow graph, and a portion of the nodes has at leastone child node, derive a data flow graph from the control flow graph,the data flow graph including data flow nodes and data flow edgesconnecting the data flow nodes, derive a control dependence graph fromthe control flow graph has control dependence nodes. The processor isoperative to derive the control dependence graph by assigning each ofthe nodes of the control flow graph an innate property that is inheritedby the at least one child node thereof in equal proportions as inheritedproperties therein, in each of the nodes canceling ones of the inheritedproperties that sum to the innate property thereof, maintainingrespective inheritance records of the inherited properties of the nodes,the inheritance records including identifications of the nodes that aresources of origin of respective the inherited properties, identifying anentry node in the control flow graph, identifying a first set of thenodes whose members lack inherited properties, establishing respectivefirst edges between members of the first set and the entry node,identifying a second set of the nodes, wherein members of the second sethave inherited properties, identifying in members of the second set arespective closest topological order of the sources of origin in theinheritance records thereof, respectively, and constructing second edgesbetween the members of the second set and the sources of origin havingthe closest topological order, respectively. The processor is operativeto analyze the control flow graph, the data flow graph and the controldependence graph to identify a portion of the source code has a securityvulnerability and report the security vulnerability.

Other embodiments of the invention provide methods and computer softwareproducts for carrying out the functions of the data processing system.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the present invention, reference is madeto the detailed description of the invention, by way of example, whichis to be read in conjunction with the following drawings, wherein likeelements are given like reference numerals, and wherein:

FIG. 1 is a block diagram of a system for processing computer programcode, in accordance with a disclosed embodiment of the invention;

FIG. 2 is a detailed block diagram of a source code analysis engineshown in FIG. 1, in accordance with a disclosed embodiment of theinvention;

FIG. 3 is a composite diagram illustrating aspects of source codeanalysis using the source code analysis engine shown in FIG. 1, inaccordance with a disclosed embodiment of the invention;

FIG. 4 is a control flow graph that represents a single source codestatement, in accordance with a disclosed embodiment of the invention;

FIG. 5, is a flow chart of a method of constructing an invocation-awaresingle method control flow graph, in accordance with a disclosedembodiment of the invention;

FIG. 6 is an exemplary invocation-aware single method control flowgraph, in accordance with a disclosed embodiment of the invention;

FIG. 7 is a data flow graph that is constructed in accordance with adisclosed embodiment of the invention;

FIG. 8 is a flow chart of a method for constructing a data flow graph,in accordance with a disclosed embodiment of the invention;

FIG. 9 is a diagram illustrating construction of a data flow graphaccording to the method shown in FIG. 8, in accordance with a disclosedembodiment of the invention;

FIG. 10 is a diagram that illustrates a process of building a data flowgraph in accordance with a disclosed embodiment of the invention;

FIG. 11 is a flow chart of a method of establishing potentials in thenodes of a control flow graph for use in constructing a controldependence graph, in accordance with a disclosed embodiment of theinvention;

FIG. 12 is a flow chart illustrating further details of the method ofFIG. 11, in accordance with a disclosed embodiment of the invention;

FIG. 13 is a composite diagram illustrating construction of a controldependence graph in accordance with the methods disclosed with referenceto FIG. 11 and FIG. 12, in accordance with a disclosed embodiment of theinvention;

FIG. 14 diagrammatically illustrates stub replacement in a control flowgraph, in accordance with a disclosed embodiment of the invention;

FIG. 15 is a series of control dependence graphs illustrating closurecomputation, in accordance with a disclosed embodiment of the invention;

FIG. 16 is a flow chart of a method for identifying a possibility ofunvalidated input in a computer program, in accordance with a disclosedembodiment of the invention;

FIG. 17 diagrammatically illustrates processing of data flow graphs todetermine unvalidated input vulnerabilities in accordance with adisclosed embodiment of the invention; and

FIG. 18 diagrammatically illustrates processing of an exemplaryproprietary data validation function in accordance with a disclosedembodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

In the following description, numerous specific details are set forth inorder to provide a thorough understanding of the present invention. Itwill be apparent to one skilled in the art, however, that the presentinvention may be practiced without these specific details. In otherinstances, well-known circuits, control logic, and the details ofcomputer program instructions for conventional algorithms and processeshave not been shown in detail in order not to obscure the presentinvention unnecessarily.

Software programming code, which embodies aspects of the presentinvention, is typically maintained in permanent storage, such as acomputer readable medium. In a client/server environment, such softwareprogramming code may be stored on a client or a server. The softwareprogramming code may be embodied on any of a variety of known media foruse with a data processing system, such as a diskette, or hard drive, orCD-ROM. The code may be distributed on such media, or may be distributedto users from the memory or storage of one computer system over anetwork of some type to other computer systems for use by users of suchother systems.

Definitions

The term “vulnerability” refers to a section of program source code,which when executed, has the potential to allow external inputs to causeimproper or undesired behavior. Examples of vulnerabilities includebuffer overflow, race conditions, and privilege escalation.

“Control flow” refers to a logical execution sequence of programinstructions beginning, logically, at the beginning, traversing variousloops and control transferring statements (branches), and concludingwith the end or termination point of the program.

A “control flow graph” (CFG) is a graphical representation of paths thatmight be traversed through a program during its execution. Each node inthe graph represents a basic block, i.e., a straight-line piece of codewithout any jumps or jump targets; jump targets start a block, and jumpsend a block. Directed edges are used to represent jumps in the controlflow.

“Data flow” refers to the process within the program whereby variablesand data elements, i.e., data that is stored in program memory eitherdynamically or statically on some external memory unit, are read from orwritten to memory. Data flow includes the process whereby variables ordata inputs or outputs are defined by name and content and used and/ormodified program execution. Data flow may be graphically represented asa “data flow graph”.

System Overview

Turning now to the drawings, reference is initially made to FIG. 1,which is a block diagram of a system 10 for processing computer programcode, in accordance with a disclosed embodiment of the invention. Thesystem 10 typically comprises a general purpose or embedded computer 12,which is provided with conventional memory and I/O facilities, andprogrammed with suitable software for carrying out the functionsdescribed hereinbelow. Thus, although portions of the system 10 areshown as comprising a number of separate functional blocks, these blocksare not necessarily separate physical entities, but rather representdifferent computing tasks or data objects stored in a memory that isaccessible to the processor. These tasks may be carried out in softwarerunning on a single processor, or on multiple processors.

The software may be provided to the processor or processors on tangiblemedia, such as CD-ROM or non-volatile memory or may be retrieved fromstorage over data networks. Alternatively or additionally, the system 10may comprise a digital signal processor or hard-wired logic. The system10 may include a display 14, enabling an operator to interact with thesystem, typically via a graphical user interface.

The system 10 receives application source code 16, which is intended tobe transformed into executable code. Typically, the transformation isaccomplished by compilation to generate object code, and linking of theobject code with library code, as is known in the art. However, theprinciples of the invention are equally applicable to softwaredevelopment systems in which intermediate representations are employed,or development environments employing source code interpreters.

The system 10 includes a source code analyzer 18 (SCA). This is a modulethat automatically scans the source code 16 in order to detectapplication level vulnerabilities. The source code analyzer 18 comprisesa plurality of distinct layers, which can be independently modified.Each of the layers is coupled only to adjacent layers, which provides aconsiderable degree of isolation. Modifications to one of the layersgenerally affect only the succeeding layer that receives input from themodified layer. One layer is a security-related layer 20, which holds aset of queries that detect various types of security vulnerabilities,which are discussed in further detail below. The other layers inaggregate form a SCA engine 22 that is harnessed by the security-relatedlayer 20. The modular architecture provides a high degree offlexibility. For example, it possible to exchange the security-relatedlayer 20 with another module that interfaces with the SCA engine 22. Thelayers forming the SCA engine 22 perform a variety of functions relatingto the application being analyzed, e.g., evaluation of reliability,performance, and compliance with specifications and standards.

Reference is now made to FIG. 2, which is a block diagram of the SCAengine 22 (FIG. 1), in accordance with a disclosed embodiment of theinvention. The SCA engine 22 has the following layers and components,which are described in further detail hereinbelow:

layer 24 (including classification module 26 and decompiler 28);

layer 30 (parser 32);

layer 34 (AST module 36);

layer 38 (DOM builder 40, including shallow DOM solver 42 and deep DOMsolver 44);

layer 46 (CFG module 48);

layer 50 (DFG module 52);

layer 54 (CDG module 56);

layer 58 (CDG+DFG module 60);

layer 62 (database 64);

layer 66 (SFTA engine 68); and

layer 70 (unit testing engine 72).

Source Code Analyzer—Classification

Source code 16 constitutes the principle input to the SCA engine 22. Thesource code 16 is passed to classification module 26. Current versionsof the SCA engine 22 are capable of scanning source code written inJava, C#, .NET, and server-side scripting languages JSP (Java ServerPage) language, and ASPX (Active Server Page Framework). It should benoted, however, that the principles of the invention disclosed hereinare not limited to these particular languages. Converters may beconstructed by those skilled in the art that enable the SCA engine 22 toprocess other computer languages. The classification module 26determines which language is applicable to the source code 16. If thesource code 16 is malformed or is presented in a language to which theSCA engine 22 has not been adapted, then the classification module 26reports an error.

The SCA engine 22 can scan programs developed in environments thattransform source code into intermediate representations, using knowndecompilation techniques. For example both the .NET™ framework and Javaplatform work by transforming source code into an intermediate language,rather than machine code, Recognition by the SCA engine 22 that thesource code 16 is received in an intermediate representation invokes thedecompiler 28, which transforms the intermediate code into a higherlevel representation that is capable of being analyzed in the SCA engine22. The decompiler 28 and elements of the classification module 26 canbe constructed using known techniques, as taught in U.S. Pat. Nos.5,881,290, 5,586,330, 5,586,328, and 7,210,133, which are hereinincorporated by reference. A suitable decompiler, “Reflector for .NET”is available from Lutz Roeder via the Internet.

Some programming practices tend to frustrate conventional codeanalyzers, for example code obfuscation. However, as the SCA engine 22is oriented toward the evaluation of code logic, which is not destroyedby code obfuscation. Moreover, the SCA engine 22 is not troubled byoften obscure identifiers that are generated in reverse engineeredprogram code.

Modern programming languages, such as .NET support event drivenprograms. The call graph of such programs is often poorly defined, asthe order of function calls is deferred until runtime. The SCA engine 22makes no assumptions about the order of raising various events,including Web events that occur in ASPX and JSP code. Such events areconverted to C# and Java code, respectively, with introduction ofappropriate meta-loops and select-case switches. The classificationmodule 26 and decompiler 28 form first layer 24.

Parsing

Continuing to refer to FIG. 2, in layer 30 the classified and optionallydecompiled source code is processed in the parser 32, where it isdecomposed into individual tokens, The tokens are then passed into layer34, and arranged, according to the grammar of the particular language,into an abstract syntax tree (AST) in AST module 36. This step isconventional. For example, the tool ANTLR, available from antlr.org onthe Internet, is suitable for use as the AST module 36.

DOM Builder

The DOM builder 40 in layer 38, produces a document object model (DOM),which represents each code element of the abstract syntax tree by amatching object. The DOM builder 40 comprises two principle modules. Theshallow DOM solver 42 (AST2DOM) receives a language-dependent abstractsyntax tree, and returns an almost language-independent document objectmodel. The output of the shallow DOM solver 42 is a “shallow”representation, in which logical connections between distant objectshave yet to be established. The deep DOM solver 44 creates theseconnections based on relevant specifications of the language, whichproduces a document object model that is fully language-independent. Theimplication is that subsequent layers that make use of the documentobject model need not be language-aware. Thus, in order to support a newlanguage, at most only the layers 24, 30, 34, 38 may need to be changed.Layers succeeding the layer 38 are unaware of the source code language.

Operation of the deep DOM solver of the deep DOM solver 44 can beappreciated by a consideration of Listing 1. Analysis of source code mayrequire a determination whether the references “i” indicate the samevariable, or two different variables sharing the same identifier. If thetwo lines are in the same block, then the two references to “i” refer toone variable. However, if the lines are in two different blocks, thenthe determination depends on the visibility of the variables, thelocations where they were declared, the inheritance hierarchy ofcontaining class, etc. While the shallow DOM solver 42 recognizesvariables. The connection between them is established in the deep DOMsolver 44, where each variable referenced by the DOM builder 40 isassigned a data member named “InstanceID”. The shallow DOM solver 42assigns each referenced variable a different value for the memberInstanceID, even if they reference the same variable. Thus in Listing 1,the two references to “i” receive different values of the memberInstanceID, even if the both refer to the same variable. In the deep DOMsolver 44, all references to the same variable are assigned the samevalue to its member InstanceID. Methods have a similar mechanism, butinstead of the member InstanceID, method declarations and invocationsare assigned a member known as “DefinitionID”, which serves the samepurpose.

Example 1

Reference is now made to FIG. 3, which is a composite illustrationillustrating aspects of source code analysis using the SCA engine 22(FIG. 2), in accordance with a disclosed embodiment of the invention.Exemplary source code 76 has been processed to form an abstract syntaxtree 78. A node 80 “NAMESPACENODE” in the second row of the abstractsyntax tree 78 has a child node 82 “QUALIDENT” (QUALified IDENTifier).The node 82 has the value of “CheckmarxNamespace”, corresponding to anidentifier 84 in the first line of the source code 76. This is the nameof the namespace, but as far as the abstract syntax tree 78 isconcerned, it is merely a string of characters representing anidentifier. Knowledge that the identifier 84 identifies a namespace isexploited by the shallow DOM solver 42 of the DOM builder 40 (FIG. 2).This knowledge is captured in a document object model, a portion ofwhich is shown as a DOM fragment 86. In this example, a DOM object 88“NamespaceDecl” is created, and its data member “Name” is initializedwith the value “CheckmarxNamespace”. Some properties are shown below theobject 88.

Referring again to FIG. 2, the following issues are dealt with by thedeep DOM solver 44: These are conventional, and are therefore notfurther discussed herein in the interest of brevity.

Type resolution;

Inheritance resolution;

Variables resolution;

Methods overloading resolution;

Methods overriding resolution;

Overloading resolution;

Polymorphism resolution;

Data member resolution (public/private/protected);

Constructors resolution;

“This” resolution;

Chaining; and

Base calling.

In addition, a procedure known as “member variable instances” isperformed. Member variables are fields within a data structure, e.g., aC++ class. Normally, member variables are used in different functions ormethods, but defined outside the functions or methods. In the DOMbuilder 40, each member variable receives a unique ID number. Thus,different member variables having like member names and having parentsof like data types (e.g., member name “.x” in member variables a.x, b.x)are distinguished from one another, and considered as differentvariables. Some conventional integrated development environments, e.g.,Visual Studio®, have a feature that finds all references to a designatedvariable. Referring now to Listing 2, invoking the “member variableinstances” procedure for a variable “j” (line 13), results in a findingof two references—a declaration for the variable j in line 9 and anassignment in line 13. Now consider references to a variable “a.x” inline 14. One expects to find two results—a declaration (line 3) and anassignment (line 14). However, Visual Studio would return an additionalresult—an assignment of a member variable b.x (line 15). This unwantedresult is due to the fact that Visual Studio does not distinguish membervariables having commonly named field, such as a.x and b.x from oneanother. The DOM builder 40, however, can tell them apart. The deep DOMsolver 44 would assign the different values to the member InstaceID forthe member variables a.x and b.x. The member DefinitionID together withthe member InstanceID allow the member variables to be differentiated

Control Flow Graphs

Control flow graphs are constructed in layer 46 (FIG. 2), using CFGmodule 48. Each node of a control flow graph produced by the SCA engine22 represents a single statement. Furthermore, in most cases, when asingle statement contains several expressions, each expression isrepresented by its own node. Reference is now made to FIG. 4, which is acontrol flow graph 90 that represents a single source code statement, a“for” statement, in accordance with a disclosed embodiment of theinvention. The control flow graph 90 illustrates how the components ofthe statement are elaborated into a plurality of nodes.

The CFG module 48 initially computes modular control flow graphs forsingle methods. Invocations of other methods are left intact, andcomplex expressions are divided into several atomic expressions, whilepreserving the logic of the expressions. The efficiency of this stage isO(n), where n is the number of sub expressions in the code. For example,in the control flow graph 90, the invocation Write( ) is not furtheranalyzed.

The next phase in the operation of the CFG module 48 is transformationof the control flow graph 90 into an invocation-aware single methodcontrol flow graph. In order to be able to integrate a plurality ofmethods, a placeholder for the invoked method has to be prepared. A stubis prepared. If source code is available, it eventually replaces thestub. Furthermore, in practice a calling method has makes preparationsbefore the call, and performs “cleanup” after return. Similarly, thecalled method must make some preparation at the beginning and cleanup atthe end of the call.

Reference is now made to FIG. 5, which is a flow chart of a method ofconstructing an invocation-aware single method control flow graph, inaccordance with a disclosed embodiment of the invention. At initial step92, a single method control flow graph is prepared, as described in FIG.4.

Next, at step 94, the calling method, which is the method described bythe control flow graph prepared in step 92, initializes the parametersto be sent to a called method. The parameters include the “this” object,which may be manipulated during the call, and include global variablesas well. It is desirable to create a container for the global variablesto facilitate their transport.

Next, at step 96, the called method copies its parameters into temporaryvariables. These are placed in a section termed a “prolog”.

Control now proceeds to step 98. A temporary stub is created as aplaceholder in the control flow graph. In case the source code of thecalled method is not available, it is necessary to make assumptionsabout the use of the sent parameters by the called method. There are twomain types of stubs that can be constructed in step 98. In the firsttype, it is assumed that the called method uses its parameters, but thatit does not update them. In the second type, it is assumed that theparameters influence the “this” object. The decision about the use ofeach stub is based on several heuristics such as method's name,parameter names and number and the use of return value. The second typeis used whenever the function name it replaces begins with “set*”,“add*” or “append*”, wherein the character “*” is a wild card operator.The first stub is used otherwise.

Step 100 is performed following step 98. The called method copies theparameters that it modifies (“out parameters”), which are typically “byreference” parameters, into temporary variables, termed an “epilog”.

Next, at final step 102, the calling method updates the “out” parameterswith new data in the temporary variables. The result of is a controlflow graph of a single method, embellished with prologs and epilogs, andprovided with a stub for each invoked method.

Example 2

Reference is now made to FIG. 6, which is an exemplary invocation-awaresingle method control flow graph 104, in accordance with a disclosedembodiment of the invention. The control flow graph 104 has nodes 106,108 representing a prolog, and an epilog, respectively. A stub 110 forthe called function, func2( ), is included, because in this examplesource code is unavailable for the called function func2( ).

Data Flow Graphs

Referring again to FIG. 2, data flow graphs are constructed in layer 50,using DFG module 52. Reference is now made to FIG. 7, which is a dataflow graph 112 that is constructed in accordance with a disclosedembodiment of the invention. A data flow graph describes data flowwithin a single method. Each time a value is assigned to a variable, theinfluencing object and the location of the assignment is recorded. Asresult, a graph can be constructed, in which the data flow nodes arevariables in a specific location, and connections represent dependency.The graph is transitive. Thus, a backward transitive closure computed ata specific location retrieves all influencing variables. It will benoted that each node is directed to a node that affects or influencesit. For example in node 114, the assigned value of the variable “c” isinfluenced by an assignment to a variable “a” in node 116.

Reference is now made to FIG. 8, which is a flow chart of a method forconstructing a data flow graph, in accordance with a disclosedembodiment of the invention. The process steps are shown in a particularlinear sequence in FIG. 8 for clarity of presentation. However, it willbe evident that many of them can be performed in parallel,asynchronously, or in different orders.

At initial step 118, a method is selected, and a control flow graph forthe method constructed as described above with reference to FIG. 5. Thedata flow graph is based on the control flow graph. A unique identifieris assigned to each node of at least the duplicate control flow graph.

Next, at step 120 the nodes of the control flow graph prepared ininitial step 118 are duplicated. The duplicated nodes are to be used forconstructing the data flow graph by establishing appropriate edges. Theoriginal control flow graph is normally used for other purposes by theSCA engine 22 (FIG. 2). Of course, in applications in which the originalcontrol flow is not required for other purposes, duplication may beomitted, and the original version used. Duplication of the nodes istypically implemented as a computer processing task in which the nodesare data objects, which are only optionally displayed to assist anoperator.

Next, at step 122, a node is selected from the duplicated nodes. At thisstep, any nodes that do not relate to data are ignored, and may bediscarded.

Next, at step 124 two arrays are created and associated with thecurrently selected duplicated node. These arrays are termed“VariablesThisBlockDependsOn” and “VariablesChangingLocations”, whichrespectively contain static information regarding variables on which thecurrent node depends, and dynamic information regarding variables thatare currently known to influence the current duplicated node. As will beseen from the description below, the dynamic information is developedduring a traversal of the data flow graph.

Next, at step 126, the arrays are initialized. The arrayVariablesThisBlockDependsOn is initialized with information that isstored in the current duplicated node. Once this array is filled, itnever changes. For example, a duplicated node corresponding to astatement a=b causes the one element of the arrayVariablesThisBlockDependsOn for the duplicated node to be initializedwith the value “b”. Step 126 is sometimes descriptively termed“BuildSelfStatus”. The array VariablesChangingLocations is initialized.The array VariablesChangingLocations is initialized with data relatingto the current node. Linkage to nodes containing data that influence thecurrent nod occurs at a later stage. In the example given, the statementa=b results in the one element of array VariablesChangingLocations beinginitialized with a key “a” and a value of “1”.

Control now proceeds to decision step 128, where it is determined ifmore duplicated nodes in the control flow graph need to be processed. Ifthe determination at decision step 128 is affirmative, then controlreturns to step 122 for selection of the next duplicated node.

If the determination at decision step 128 is negative, then anode-by-node traversal begins. The traversal order correspondsapproximately to a breadth-first traversal of the original control flowgraph. In a strict sense, a breadth-first search applies to ahierarchical tree structure. As the control flow graph may not be ahierarchical tree, the search initially solves parent nodes first andthen proceeds from the parents in a breadth-first manner. Controlproceeds to step 130. A duplicated node is selected.

Next, at step 132 an attempt is made to update the arrayVariablesChangingLocations for the current duplicated node by includeall relevant variable information that could influence the currentduplicated node. This is done by passing the array “by reference”,rather than “by value” to the updating function. Passing the array byreference rather than a copy spares computational resources. Theefficiency of this step is O(1). In some cases, there may not presentlybe sufficient information to do this, and the duplicated node may needto be revisited after first having completed step 132 recursively forthe node's descendants. Nodes requiring revisits are marked. The markednodes are then revisited in a depth-first manner.

Control now proceeds to decision step 134, where it is determined if thearray for the current duplicated node was successfully updated in step132. If the determination at decision step 134 is negative, then controlproceeds to decision step 136, which is described below.

If the determination at decision step 134 is affirmative, then controlproceeds to step 138. The node is classified as having been evaluated.Relevant edges will be established between the current node and otherduplicated nodes upon which it depends after all nodes have beenevaluated.

After performance of step 138, or if the determination at decision step134 is negative, control proceeds to decision step 136, where it isdetermined if more duplicated nodes need to be visited or revisited. Ifthe determination at decision step 136 is affirmative, then controlreturns to step 130.

If the determination at decision step 136 is affirmative, then controlproceeds to final step 140. Relevant edges are now constructed betweenthe nodes, as noted above in the discussion of step 138. This is done byfirst consulting the array VariablesThisBlockDependsOn, and then addingedges based on the array VariablesChangingLocations. The data flow graphis then complete, and the procedure ends.

Example 3

Reference is now made to FIG. 9, which is a diagram illustratingconstruction of a data flow graph according to the method described withreference to FIG. 8, in accordance with a disclosed embodiment of theinvention. As shown in the upper portion of FIG. 9, a data flow graph isconstructed from a fragment of source code 142.

A control flow graph 144 and a duplicate control flow graph 146 areprepared, and the nodes of the latter assigned unique identifiers (1, 2,3, 4). In the lower portion of the duplicate control flow graph 146 hasbeen elaborated to illustrate that each of its nodes is associated withrespective tables of variables—a column 148 of tables containingrespective arrays “VariablesThisBlockDependsOn”, and a set of tables150, each being offset according to the rank of its associated node inthe duplicate control flow graph 146. For example node 152 has beenassigned unique identifier “1”, and has been associated with tables 154,156. It will be recalled that each node represents a single source codestatement. The purpose of table 154 is to identify those variables uponwhich the statement of source code 142, represented by the node 152,depends.

Similarly, node 158 is associated with tables 160, 162, node 159 withtables 164, 166, and node 168 with tables 170, 172.

Beginning at the top of the duplicate control flow graph 146 andprogressing downward, the node where each variable was last changed isdetermined, and the actual data flow graph is constructed.

In the set of tables 150 each relevant variable is associated with apointer to the node to which it relates. For example, node 158,corresponding to node 174 of control flow graph 144, represents thestatement “B=A”. Node 158 has an identifier “2”. This identifier isfound in table 162, together with the variable B, which is modified inthe node 158.

Variables A and B are both relevant to node 158. The node 158 dependsonly on the variable A, as indicated by table 160. In table 162 VariableA has been entered in the upper row of table 162, and encoded “1”,corresponding to node 152, where it was last modified. In the lower row,variable B has been encoded “2”, as it was last modified in node 158.

Variable C is not mentioned in the source code statement B=A″, and isconsidered to be irrelevant to node 158. Variable C has no entry in thetables 160, 162.

Reference is now made to FIG. 10, which is a diagram that illustrates aprocess of building a data flow graph in accordance with a disclosedembodiment of the invention. The process is disclosed with continuedreference to FIG. 9. Progressing downward in the duplicate control flowgraph 146.

Variable A is changed in node 152, as indicated in partial data flowgraph 176.

Variable B is changed in node 158. Node 158 depends on variable A, whichwas changed in node 152. This is illustrated by construction of apartial data flow graph 178 and the entries of tables 160, 162.

Variable A is again changed in node 159. Node 159 also depends on thevariable A, last accessed in node 158, as indicated by table 164. A newpartial data flow graph 182 is constructed to reflect this situation.

Variable C is changed in node 168. Node 168 depends on variable B thatwas previously changed in node 158, as shown in table 172. A completedata flow graph 184 can now be constructed.

In constructing the final data flow graph, each node is evaluated onceby default. Furthermore, by virtue of the fact that the set of tables150 are built incrementally, it is only necessary to evaluate each nodeonly once for each variable on which it depends in each nesting level ofthe source code in which the variable appears. For example, if astatement is nested inside a “for” statement, which in turn is occursinside an “if” statement, then at most the node corresponding to thestatement will be solved three times. The efficiency is O(n*m) where nis the number of nodes and m is the deepest source code nesting level.

Metaconstructors

Traditionally, object oriented languages, e.g., C++, did not allow datamembers to be initialized before the class constructor executed. Newerversions, e.g., .NET do allow early initialization of data members,i.e., at declaration time. Consequently, when employing the olderlanguages, in order to construct a data flow graph, it is helpful tocreate a metaconstructor that performs all relevant assignmentoperations and initializations. In order to guarantee earlyinitialization, the class constructor is written to invoke themetaconstructor before performing any of its routine functions.

In Listing 3, exemplary source code is presented, which illustrates thepoint. Variables “i” and “j” are initialized at the time of theirdeclaration and not in the constructors. Adding a metaconstructor avoidsany issues of uninitialized variables, and enables the data flow graphto be constructed more accurately. Addition of a metaconstructor to thecode of Listing 3 is shown in Listing 4.

Control Dependence Graphs

Control dependence graphs directed graphs that are known in the softwareengineering art, and are exploited by the SCA engine 22 (FIG. 2). Muchlike a data flow graph, a control dependence graph (CDG) shows thedependency of one statement on another, but the nature of the dependencyis control rather than data. Each statement A is linked to a previousstatement B, which controls whether statement A will be executed. In acontrol dependence graph, nodes or vertices represent executablestatements, and edges represent direct control dependencies.Construction of control dependence graphs, however, is a knowncomputational bottleneck, which is mitigated by aspects of the presentinvention.

Referring again to FIG. 2, control dependence graphs are constructed inlayer 54, using CDG module 56, in which control dependence graphs arederived from control flow graphs produced by the CFG module 48 describedabove.

Construction of a control dependence graph is derived from aconsideration of the shape, i.e., topology, of the corresponding controlflow graph, rather than its content. It is assumed that each node of thecontrol flow graph corresponds to one line of the source code. However,it is the structure of the control flow graph that is now of primaryinterest.

Each node is given an attribute, referred to herein as a “potential”,which has a numerical value, and which is propagated to its descendents.Potential is a quantity, which is a reflection of a control influence ofone node upon another. By tracking the propagation of potential througha control flow graph, it is possible to extract control dependenceinformation and thereby construct an accurate control dependence graph.Only the general topology of the graph and the topological orders ofindividual nodes are significant in this process.

As a result of propagation of potential, when a record of the course ofthe propagation and the source of origin is maintained, it becomesevident that a node can possess many combinations of innate andinherited potentials, each component of which is treated separately.Several rules for the inter-nodal propagation of potential areapplicable:

Rule 1. Each node is initially assigned a potential having a value 1.0.This is referred to as “innate” potential.

Rule 2. A node propagates all its potentials to its child nodes. Thevalue of the potentials is divided equally among its immediate childnodes. Potential propagated from a parent node to a child node isreferred to as “inherited” potential in the child node. For example, incontrol flow graph 186, node 1 has two child nodes, node 2 and node 15.Each receives a potential contribution of 0.5 from node 1. Node 2possesses its innate potential of 1.0 and an inherited potential of 0.5.

Rule 3. Propagation of a node's innate potential and propagation of itsinherited potentials to a child node are treated as separatetransactions.

Rule 4. Propagated potentials are labeled with their sources of origin.When a node has inherited multiple potentials from different origins,propagations of the multiple inherited potentials to nodal descendantsare each treated individually, and accounted for separately. In theabove example involving node 2, the inherited potential of 0.5 is taggedas originating from node 1.

Rule 5. When a node inherits a potential of exactly 1.0, the inheritedpotential is nullified. This can occur, for example, when a node hasonly one child. In the control flow graph 186, node 2 has only one childnode, node 4. Node 4 has an innate potential of 1.0 in accordance withRule 1. In a first transaction, in accordance with Rule 3, the innatepotential of node 2 is propagated to node 4. Node 4 has thus inherited apotential having a value of 1.0. It is nullified. In a secondtransaction, node 4 receives the inherited potential (value 0.5) of node2. The net effect is that node 4 has innate potential of value 1.0, andinherited potential of value 0.5, the latter tagged as originating fromnode 1. The terms “first transaction” and “second transaction” are usedarbitrarily herein to distinguish the two transactions. These terms haveno physical meanings with respect to the actual temporal order of thetransactions.

Rule 6. Inherited potentials are additive for purposes of Rule 5. Forexample, a node may inherit potentials of 0.5 from each of two parents.The sum is 1.0. The two inherited potentials are therefore nullified.This actually occurs in node 13 of the control flow graph 186, and isdescribed below.

In evaluating the potentials of a control flow graph, the graph istraversed. However, any node that cannot be immediately solved isignored and visited later. Once a computation for the node isundertaken, that node is not revisited. To the extent possible,recognizing that nodes may have multiple parents, the traversal isconducted in a depth-first manner.

Reference is now made to FIG. 11, which is a flow chart of a method ofestablishing potentials in the nodes of a control flow graph as a firstphase of constructing a control dependence graph, in accordance with adisclosed embodiment of the invention.

The process steps that follow are shown in an exemplary order, but canoften be performed in many different orders according to theimplementation that may be chosen by those skilled in the art.

At initial step 188 source code is selected and a control flow graphprepared as described above. At step 190, a node is selected.

Next, at step 192 the current node is initialized. An innate potentialis assigned to the current node. In the current embodiment, this has avalue of 1.0. However, other values may be chosen, so long as it ispossible to determine whether inherited potentials sum to the value ofthe innate potential.

Control now proceeds to decision step 194, where it is determined ifmore nodes remain to be initialized. If the determination at decisionstep 194 is affirmative, then control returns to step 190.

If the determination at decision step 194 is negative, theninitialization of the nodes has been completed, and evaluation of theirpotentials begins. Control proceeds to step 196. An unevaluated node ofthe control flow graph is selected.

Control now proceeds to decision step 198, where it is determined if allparents of the current node have been evaluated. In the case of the rootnode, which has no parents, this determination is affirmative.

If the determination at decision step 198 is affirmative, then controlproceeds to step 200, which is described below.

If the determination at decision step 198 is negative, then evaluationof the current node is deferred until all the parents have beenevaluated. Control proceeds to step 202. The current node is marked forrevisit. Then, at step 204 an unevaluated parent of the current node isselected, and control returns to decision step 198

Step 200 is performed if the determination at decision step 198 isaffirmative.

The magnitude and the node of origin of all inherited potentials arerecorded. It is desirable to record the topological order of the nodesof origin, as this information may be required later.

Control now proceeds to decision step 206, where it is determined if anycombination of the inherited potentials of the current node sum to thevalue of the innate potential.

If the determination at decision step 206 is affirmative, then controlproceeds to step 208. The particular set of inherited potentials isdeleted from the record that was prepared in decision step 198. Controlreturns to decision step 206 to repeat the test using the remaininginherited potentials.

If the determination at decision step 206 is negative, then controlproceeds to step 210. All the potentials of the current node are dividedfor propagation to the child nodes in subsequent iterations inaccordance with Rule 2.

Next, at step 212 the current node is marked has having been evaluated,so that will not be revisited.

Control now proceeds to decision step 214, where it is determined ifunevaluated nodes remain. If the determination at decision step 214 isaffirmative, then control returns to step 196 for selection of a newnode.

If the determination at decision step 214 is affirmative, then controlproceeds to final step 216. Here the nodal potentials of the controlflow graph are employed to construct a control dependence graph. Thedetails are given below.

Once the potentials of the nodes in the control flow graph have beenestablished, a control dependence graph 218 can be constructed. Thefollowing rules apply to construction of control dependence graphs:

Rule 7. If a node has only innate potential, value 1.0, it depends onthe entry node (e.g., node “Enter” in control dependence graph 218).

Rule 8. If a node has multiple inherited potentials from differentsource nodes, then it depends on the source node of the correspondingcontrol flow graph that is closest in topological order to the currentnode.

Rule 9. If a node has a single inherited potential, then it depends fromthe source node of the inherited potential. Rule 9 is actually a trivialcase of Rule 8.

Reference is now made to FIG. 12, which is a flow chart illustratingfurther details of method of FIG. 11 for constructing a controldependence graph in accordance with a disclosed embodiment of theinvention. It is assumed that the method described with reference toFIG. 11 has been performed. The steps described below are an expansionof final step 216 (FIG. 11).

At initial step 220 an entry node is established for the controldependence graph.

Next, at step 222 a node of the control flow graph is selected. A depthfirst traversal with respect to solved parent nodes traversal issuitable for traversing the control flow graph.

Control now proceeds to decision step 224, where it is determined if thecurrent node has inherited potentials.

If the determination at decision step 224 is negative, then it isconcluded that the current node only has innate potential and Rule 7applies. The current node depends directly on the entry node. Controlproceeds to step 226. An edge is established between the entry node andthe current node. Control then proceeds to decision step 228, which isdescribed below.

If the determination at decision step 224 is affirmative, then controlproceeds to step 230. It will be recalled that in step 200 (FIG. 11),the source node of origin of each inherited potential was recorded. Instep 221, the topological orders of the sources are compared and thesource node or nodes having the topological order closest to that of thecurrent node are selected.

If there is only one inherited potential, then Rule 9 applies. Thesource node from which the single inherited potential derives isselected. Otherwise, Rule 8 applies. If a plurality of source nodesshare the closest topological order, then all such source nodes areselected.

Next, at step 232 edges are established between the source node or nodesthat were selected in step 230 and the current node.

Control now proceeds to decision step 228, where it is determined ifmore nodes in the control flow graph need to be visited. If thedetermination at decision step 228 is affirmative, then control returnsto step 222 for selection of a new node.

If the determination at decision step 228 is negative, then controlproceeds to final step 234. The control dependence graph is now completeand the procedure ends.

Example 4

Reference is now made to FIG. 13, which is a composite diagramillustrating construction of a control dependence graph in accordancewith the methods disclosed with reference to FIG. 11 and FIG. 12, inaccordance with a disclosed embodiment of the invention. Exemplarysource code 236 maps to control flow graph 186. In the notation used forcontrol, dependence graph 218 nodes are distinguished from theircorresponding equivalents in the control flow graph 186 by a suffix “d”.

The assignment of nodal potentials is now described with reference tothe control flow graph 186.

Node 1 is visited first (step 1). It has no ancestors, and is assigned apotential value of 1.0 (Table 1, Row 1).

TABLE 1 Status Row Node Source Potential Remarks 1 1 1 1 2 2 1 0.5 3 2 14 4 1 0.5 5 4 1 6 5 1 0.5 7 5 1 8 7 1 0.25 9 5 0.5 10 7 1 11 11 1 0.2512 5 0.5 13 11 1 14 13 1 0.5 15 13 1 16 7 0.5 + 0.5 (canceled) 17 15 10.5 (canceled) Rcvd directly from Node 1 18 1 0.5 (canceled) Rcvd viaNode 13 19 15 1

Since node 1 has two children (node 2 and node 15), it divides itspotential among them. Thus, in step 2, node 2 and node 15 each inherit apotential value of 0.5 from node 1. Node 15 is discussed below. Node 2has inherited potential of 0.5 (Table 1, Row 2) and innate potential of1.0 (Table 1, Row 3). Node 2 has one child node, node 4.

Now node 4 is considered. The two potentials derived from node 2 aretreated separately. It will be recalled from the discussion of Rule 5that node 4 inherits the innate potential of node 2, but since it equalsone, it is canceled. This transaction is omitted from Table 1.Node 4 hasreceived from node 2 an inherited potential of 0.5 derived its remoteancestor, node 1, (Table 1, row 4). Additionally, its has innatepotential 1.0 (Table 1, row 5).

Node 4 propagates 100% of its inherited potential to node 5, its onlychild node (Table 1, row 6). Node 4 also has innate potential 1.0 (Table1, row 7).

Node 5 has two children, nodes 7, 11, and distributes its potentialsamong them in accordance with Rule 2. Node 7 is described first. In afirst transaction 50% of the inherited potential of node 5 (Table 1, Row6), value 0.25 is propagated to node 7 (Table 1, Row 8). In a secondtransaction, 50% of the innate potential of node 5 (Table 1, Row 7),value 0.5 and deriving from node 1, is propagated to node 7 (Table 1,Row 9). Node 7 has an innate potential, value 1.0 (Table 1, Row 10).

Node 13 is now visited. The order of visitation of the nodes in thecontrol flow graph 186 is not critical, and the particular orderdetailed herein is exemplary. However, it is apparent that the requisiteinformation required from one of its parents, node 11, has not yet beendetermined. Node 13 cannot presently be evaluated, and is deferred.

Node 11 is now visited and evaluated. The details are identical to node7 and are not repeated in the interest of brevity.

Node 13 is reconsidered. It receives identical distributions of firstinherited potentials from node 7 (Table 1, Row 8) and node 11 (Table 1,Row 11), each value 0.25. Both of these are originally derived fromnode 1. They are combined for convenience in one row (Table 1, Row 14).Node 13 has innate potential, value 1.0 (Table 1, Row 15). In anothertransaction, node 13 also receives identical second inherited potentialsfrom node 7 (Table 1, Row 9) and node 11 (Table 1, Row 12), each havingvalue 0.5. The second inherited potentials are derived from their commonparent, node 5. They total 1.0, and are therefore canceled in accordancewith Rule 5 (Table 1, Row 16).

The last node to be considered is node 15. In a first transaction, 50%of the innate potential of one of its parents, node 1, value 0.5 (Table1, Row 17). In a second transaction, inherited potential held in theother parent, node 13 (Table 1, Row 14), which also originated from node1, is propagated to node 15 (Table 1, Row 18). As the two inheritedpotentials of node 15 total 1.0, they are canceled in accordance withRule 5. Node 15 is left with innate potential, value 1.0 (Table 1, Row19).

Construction of the control dependence graph 218 is now described:

Node 1 only has innate potential, value 1.0. It is shown as Node 1 d inthe control dependence graph 218, and, in accordance with Rule 7,depends on node “Enter”.

Node 2 has one inherited potential (Table 1, Row 2) deriving fromnode 1. Consequently, node 2d depends on node 1d, in accordance withRule 9.

Node 4 has one inherited potential (Table 1, Row 4) deriving fromnode 1. Consequently, node 4d also depends on node 1d.

Node 5 has one inherited potential (Table 1, Row 6). Therefore, node 5dalso depends from node 1d.

Node 7 has two inherited potentials, (Table 1, Rows 8, 9), derived fromnode 1 and node 5. Rule 8 now applies. Referring to the control flowgraph 186, node 5 has a greater topological order than node 1.Therefore, node 7d depends on node 5d. In like manner, node 11d dependsfrom node 5d.

Node 13 has one remaining inherited potential (Table 1, Row 14),originating from node 1. It may be noted that the cancellation of thetwo inherited potentials originating from node 7 (Table 1, Row 16)eliminates node 7d from consideration as a candidate for dependency.Node 13 therefore depends from node 1 in accordance with Rule 9.

Node 15 has only innate potential, value 1.0 (Table 1, Row 19), itsinherited potentials (Table 1, Rows 17, 18) having been canceled. Node15 therefore depends on node “Enter” in accordance with Rule 7.

In preparing the control flow graph 186 and the control dependence graph218, each node is evaluated only once. Storing the solution in a tablesuch as Table 1, e.g., a hash table, yields a total efficiency of O(n)where n is the number of nodes in the graph.

System Dependence Graphs

In order to follow data flow and control dependence through entiresystems, the graphs are linked together. Referring again to FIG. 2, thisprocess is performed in layer 58, using CDG+DFG module 60. A systemdependence graph (SDG) can be regarded as a larger,application-encompassing control flow graph. The system dependence graphhas the same properties as the control flow graph, except that insteadof creating stubs as in the case of the CFG module 48, the CDG+DFGmodule 60 adds edges to the single method control flow graph of thecalled method. As a method may be called more than once, codes areassociated with invocations and returns, e.g., color properties. Theseare expedient to direct invocations and returns in the graph in adesired order. The term “color” used herein is arbitrary to indicate anindex to a particular invocation or return. Such codes may beimplemented in many ways. Indeed, the graphs are generally not actuallydisplayed, except for purposes of explication. In cases of polymorphism,all possible paths are constructed.

In order to follow data flow and control dependence through an entireapplication, it is necessary link single method graphs. In the case of acontrol dependence graph, a link is established between the invokingnode of the calling method to the entrance node of method being invoked.This implies that every source code statement in the invoked method hasa control dependence on invoking statement.

Data flow graphs are more complicated, especially when usingobject-oriented languages. Three issues need to be confronted:

First, in object-oriented languages, parameters may contain severaldatamembers, each of which may itself recursively incorporate other datamembers. In order to trace data flow it is necessary to treat with eachdata member and component individually. In practice, a simple functionthat receives a single parameter may require an expansion of the datastructure, so that many parameters may be processed in the data flowgraphs.

Second, in object-oriented languages, a “THIS” object exists, whichrefers to the object that is currently active. Information concerningthe “this” object has to flow between method invocations to correctlydescribe data flow. The issue is resolved by treating the “this” objectas the first parameter to each called method.

Third, global variables present another complication, as they can beaccessed from virtually everywhere in the application. This is anexception to the hierarchical behavior of object-oriented programming.It is dealt with by defining a “super-global” variable that passes as aparameter to all methods, Global variables are assigned as data membersof the super-global variable. When the super-global variable is expandedalong with other parameters, the global variables therein are alsopassed to the called method.

Listing 5 illustrates handling of all three issues. At first, it seemsthat only one parameter is passed to the function func( ):

Public void func(myClass ins).

First, the THIS object and Super-Global variable are added. Now thefunctions appears as follows:

Public void func(THIS, SuperGlobal, myClass ins).

Second, the data members of each parameter are expanded. The THIS objectcontains one data-member (var3), The Super-Global variable contains onedata-member (Session[“Hello”]) and “ins” has two data members (var1,var2). After expansion, the function appears as follows:

Public void func(THIS,THIS.VAR3, SuperGlobal, SuperGlobal.Session-Hello,myClass ins, ins.var1, ins.var2).

Reference is now made to FIG. 14, which diagrammatically illustratesstub replacement in a control flow graph, in accordance with a disclosedembodiment of the invention, based on source code 239. After expandingthe parameters, relevant nodes of one single method control flow graphare linked to the prolog of another single method control flow graph. Asimilar link is established from the epilog to return values. In FIG.14, the calling function, func1( ) is shown as a column of nodes 238 onthe left, and the invoked function, func2( ) as a column of nodes 240 onthe right. Edges 242, 244 link the functions at the points of invocationand return, respectively. As noted above, colors of such links areassigned by the CDG+DFG module 60 (FIG. 2) for convenience of theoperator. The linking process may be iterated to create large,application-encompassing graphs, such as a system dependence graph.

DOM Operations

It is desirable to store the document object model can be stored in anobject-oriented database. Suitable databases for this purpose includethe model db4o, available from db4objects, Inc. 1900 South NorfolkStreet, Suite 350, San Mateo, Calif., 94403 and the Versant™ ObjectDatabase, available from Versant Corporation, 255 Shoreline Drive, Suite450, Redwood City, Calif. 94065.

Advantages of this approach include rapid storage and retrieval of thedocument object model, thereby avoiding need for its recalculation.Database storage enables querying the source code for staticcharacteristics, e.g., using query languages such as OQL. Furthermore,automatic updating of code can sometimes be accomplished with the aid ofan object database.

Referring again to FIG. 2, layer 62 includes the object orienteddatabase 64. Listing 6 illustrates the process of storage and retrievalusing the database 64. Listing 7 is an example of manipulating thedocument object model, in which public data members are changed toprivate data members.

Code Graph Querying

The preceding description concerns development of raw information aboutthe source code. In order to transform the information into workableknowledge, some data mining is required. There are two ways to fulfillthis requirement:

The first method is to use hard-coded customized functions. Oncedeveloped, such functions are easy to use, but they are inflexible, anddifficult to adapt to particular applications or local userrequirements.

Alternatively, one can employ a query language. This language isflexible enough to retrieve any static and dynamic knowledge from thedata that might be needed. However, to be used effectively scriptingskills are required on the part of the user.

The SCA engine 22 employs a query language that has been extended byspecialized built-in functions. This has all the advantages of bothmethods—it is easy to use on one hand, and highly configurable on theother. An expert user can tailor the queries to his specific needs, oreven write queries from scratch, whereas a novice has only to “point andclick”.

The scripts developed by the query language can be used in order toperform code slicing, either syntax preserving or semantic preserving.Program slicing is a technique for aiding debugging and programcomprehension by reducing complexity. The essence of program slicing isto remove statements from a program that do not affect the values ofvariables at a point of interest. Program slicing is a technique wellknown in the art.

Example 5

This example displays code slicing using the following code fragment ofListing 8. It is desired to learn what influences the Write statement inline 4. The code is analyzed or “backward sliced”, preserving syntax.The “slice” is computed by working backwards from the point of interestfinding all statements that can affect the specified variables at thepoint of interest and discarding the other statements. In slicing using“syntax preserving”, the syntax of the original program is largelyuntouched. Irrelevant statements are simply removed to create a programslice.

The statements “a++” (line 3) and “a=3” (line 1) are obviously relevant.The resulting slice is shown in Listing 9, in which omitted code isindicated by a dashed line. However, the result does not compilecorrectly. In line 3 of Listing 9, the value of variable “b” is set, butvariable b is never declared. This fragment illustrates a drawback ofusing pure syntactic preserving slicing: a statement may contain amixture of relevant and irrelevant expressions, in which case the resultdoes not compile.

A solution is to use a known technique known as “semantic preservingslicing”, in which only semantics-preserving transformations areallowed. This is achieved by splitting blocks in the control flow graphinto atomic elements, which represent a single action. Applying thistechnique results in the code fragment of Listing 10.

The query language of the current embodiment contains the commands shownin Table 2, in which X and Y are arrays of objects.

TABLE 2 X.DataInfluencedBy(Y) All object of X that are data influencedby objects from Y. Y.DataInfluencingOn(X) All object of Y that are datainfluencing on objects from X. X.ControlInfluencedBy(Y) All object of Xthat are control influenced by objects from Y. Y.ControlInfluencingOn(X)All object of Y that are control influencing on objects from X. X.InfluencedBy(Y) All objects of X that are influenced by objects from Y(either data or control). Y.InfluencingOn(X) All objects of Y that areinfluencing on objects from X (either data or control).X.ExecutesBefore(Y) All object of X that are executed before any of theobjects in Y. Y.ExecutesAfter(X) All object of Y that are executed afterany of the objects of X. X.FindByID(n) All objects of X that their id isn (id is a unique identifier each object in the system has. This impliesthat the returned array may contain one object at most). X.FindByName(s)All objects of X that are their name is/contains s (supports wildcards).X.FindByLocation(loc) All objects of X that are located in the specifedlocation (line, row). X.FindByType(typeof(t)) All objects of X thattheir DOM object is of type t (For example, find all fielddeclarations). X.FindByType(t) All objects of X that are of type t (Forexample, find all int variables). X.FindByQuery(q) All objects of X thatmatch specific query (see section 10 above). X − Y All objects of X thatare not in Y. X + Y All objects of X together with all the objects in Y.X * Y All objects of X that are also in Y. X/Y All objects of X that arenot in Y together with all objects of Y that are not in X.X.DataInfluencedByAndNotSanitized(Y, Z) All objects of X that are datainfluenced by Y, and there is a path between Y and X that doesn't gothrough Z (see section 15 below). All.DirectlyDataInfluencingOn(X) Theobjects that directly affect the value of any of the objects in X.All.DirectlyDataInfluencedBy(X) The objects that are directly affectedby the value of the object in X. Chopping Notice that chopping isexactly like InfluencedBy*InfluencingOnAll.InfluenceByAndNotSanitized(X, Y) The objects that are affected by X,in a path that doesn't contain Y. All.CallingMethodOfAny (X) Objectscalling to one of these methods. All.GetClass (X) Get the classcontaining object X. All.GetByClass (X) Get all objects contained inclass X. All.FindByMemberAccess (string) Find all access to thespecified member. All.FindByAssignmentSide(AssignmentSide) Find allobjects on the specified side of an assignment expression.

Using the commands in Table 2, any type of dependence (data, control) orexecution (control flow), in any order (By, On) can be calculatedeasily.

Example 6

The following query, using the commands shown in Table 2, reveals theeffect on an application of changing a Boolean value from true to false:

Result=All.InfluencedBy (All.FindByName (“namespace1.class1.bool1”)).

In order to find all locations where data is influenced by variable A orvariable B, but not both, use the following query:

Result=All.DataInfluencedBy (A)/All.DataInfluencedBy (B);

In order to find all locations that influence object #3 and areinfluenced by object #5, queries can be chained:

Result=All.InfluencingOn(All.FindByID(3)),InfluencedBy(All.FindByID(5)). Query Implementation

Much of the computational effort in servicing queries involves searchingfor specific objects in large graphs. Various methods are employed toservice the query, particularly those listed in Table 2. These methodsgenerally involve searches for different types of objects. A commonsearch method returns the forward/backward closure from a specifiedlocation. Each method involved in a particular query then parses theclosure results. In order to avoid infinite loops visiting the same nodemore than once, unless distinguished by a different color property, asexplained above.

Reference is now made to FIG. 15, which are exemplary control dependencegraphs 246, 248, 250, illustrating use of a “leapfrog” closurecomputation” algorithm, in accordance with a disclosed embodiment of theinvention.

In FIG. 15, as best seen on the graph 250, edges 252, 272 (red) edges254, 256 (green) are given distinctive properties because they showinvocations and returns to and from functions. Edges 258, 260 (blue) aregiven color properties for the same reason.

The traversals are marked or “painted” according to the following rules:

Rule 10. A graph is initially unmarked.

Rule 11. An unmarked section of a graph may be marked or “painted”,denoting that that a stub has been replaced by a section leading to andfrom the source code of a function.

Rule 12. A marked section of a graph may be skipped in a subsequenttraversal.

It is sufficient to only mark or paint boundary portions of therespective sections, it being assumed that intermediate portions arealso marked. In stack-based implementations, such boundary markingscorrelate with push and pop operations.

Referring first to graph 246, a first traversal during a searchoperation or closure computation follows a path from node 262 (a) tonode 264 (e). Most of the graph is shown unmarked. However, during afirst traversal node 262 has been reached. Node 262 is an entry point tosome function in the source code. Edge 252 has been painted “red” inaccordance with Rule 11. In preparation for marking a matched sectionwhen a return from the function occurs, a property “red” is pushed ontoa stack.

Referring next to graph 248, the traversal passes through a sectionbounded by node 266 and node 268. These nodes indicate invocation andreturn from another function. At node 266, a property “blue” is pushedonto the stack.

Upon exiting node 268 the property blue is popped from the stack,correlating with the blue coloration of edge 260 and node 268. Now theproperty “red” is again at the top of the stack.

At a branch at node 270, at which point a return from the first functionoccurs, edge 272 is painted red. The property “red” is popped from thestack. The traversal of path leading from node 262 to node 264 (e) isthen completed uneventfully.

Eventually a second traversal a path leading from node 274 (b) to node276 (f). At edge 254, which is initially unmarked in accordance withRule 10, the first function is invoked again, from a different location,and the property “green” is pushed onto the stack. The property green ispushed onto the stack. The treatment is the same as for edge 252,discussed above, but the properties of edge 254 and edge 252 aredistinguishable.

Now the traversal reaches node 266. However, the record of a previoustraversal through node 266, edge 258, node 278, and node 280 a call tothe second function is again recognized. All computations associatedwith the call and return to the second function are known, and thesegment is skipped in accordance with Rule 12, as indicated by a brokenline 282.

Such “contractions” of the graph enable a subsequent traversal of a pathto skip or leapfrog previously marked sections of the graph, possiblyrepresenting large sections of code. A contraction between nodes 266,268 is established, including nodes 280, 278 and their incoming andoutgoing edges. Now, when it is attempted to traverse the graph a secondtime, following a path between node 274 to node 276, the sectiondelineated by edges having the property “blue” is skipped, and is notseen. As this section already appears in the closure, no information ismissed. Rather, the computation is accelerated by avoiding sections ofcode in the leapfrog operation. The second traversal follows broken line282.

Upon exiting node 270, corresponding to a return from the secondinvocation of the first function, the property “green” is popped fromthe stack, and edge 256 painted green. Should a subsequent traversal(not shown) involve a path leading through edges 254, 256, anothercontraction, denoted by nodes 284, 286 would be executed, which would beeven larger than the contraction denoted by line 282.

Vulnerability Queries

Referring again to FIG. 2, layers of the SCA engine 22 described aboveprovide an infrastructure for code querying. The following sectionsdescribe application of these layers for detection of codevulnerabilities.

The following vulnerabilities and issues can be detected, as well asothers not listed: unvalidated input; persistent attack; leastprivilege; logical flaws; pages without graphical user interface (GUI)access; display of confidential information; I/O from virtual directory;data validation issues; broken access control; protection methodology;and source sensitive wizard.

Unvalidated Input

Unvalidated input points provide attackers with entry points to anapplication. Application without entry points, that does not receive anyinput from users, is not likely to be attacked. Input validation is usedto verify that input entered from the user complies with predeterminedrules, an issue that software developers sometimes ignore or fail toimplement properly.

The SCA engine 22 (FIG. 2) uses data flow graphs to locate input siteslacking validation. For this purpose, nodes of the data flow graph areassigned to one of three categories, denoted as [1], [2], [3]. Entrypoints in the application are assigned to category [1]. Such pointspotentially contain unsafe input. Category [2] corresponds to inputvalidation functions, which typically sanitize input data. Category [3]applies to places where the data is consumed. It is at the placesclassified as category [3] that the SCA engine 22 verifies that onlyacceptable data is processed.

Reference is now made to FIG. 16, which is a flow chart of a method foridentifying a possibility of unvalidated input in a computer program, inaccordance with a disclosed embodiment of the invention. The methodshown below involves closure of a flow graph follow removal of somenodes. Applying closure in this manner increases the efficiency of graphtraversals, in that conditional statements need not be evaluated. Atinitial step 288, a data flow graph is prepared, for example asdescribed above with reference to FIG. 8. A traversal of the graph isnow begun, applying the “leapfrog” method described above.

At step 290, a node is selected and categorized as described above.

Control now proceeds to decision step 292, where it is determined if thecurrent node represents an input validation function (category [2]).

If the determination at decision step 292 is affirmative, then controlproceeds to step 294. The current node and its incoming and outgoingedges are removed from the data flow graph.

After performing step 294, or if the determination at decision step 292is negative, control proceeds to decision step 296, where it isdetermined if there are more nodes to be processed in the data flowgraph. If the determination at decision step 296 is negative, thencontrol returns to step 290.

If the determination at step 294 is negative, then the first phase ofthe procedure has been completed. Only nodes categorized [1] or [3]remain in the data flow graph.

Control proceeds to step 298. A node of category [1] is selected.

Next, at step 300 an edge leading away from the current node is chosen.

Control now proceeds to decision step 302, where it is determined if thecurrent edge extends to a node where input is used (category [3]). Ifthe determination at decision step 302 is affirmative, then controlproceeds to step 304. The current path is classified as unsafe.

After performing step 304, or if the determination at decision step 302is negative, then control proceeds to decision step 306, where it isdetermined if there are more edges leading from the current node.

If the determination at decision step 306 is affirmative, then controlreturns to step 300.

If the determination at decision step 306 is negative, then controlproceeds to decision step 308, where it is determined if there are morecategory [1] nodes in the data flow graph. If the determination atdecision step 308 is affirmative, then control returns to step 298,where a new node is chosen.

If the determination at decision step 308 is negative, then the dataflow graph has been fully evaluated. Control proceeds to final step 310,and the procedure ends.

In an alternate implementation of the method, nodes of category [3] maybe selected at step 298 and connections between category [1] nodes andcategory [3] nodes determined by evaluating edges leading into thecategory [3] nodes.

Example 7

Reference is now made to FIG. 17, which illustrate processing of dataflow graphs to determine unvalidated input vulnerabilities in accordancewith a disclosed embodiment of the invention. Data flow graphs 312, 314correspond to code 316. Nodes of categories [1], [2], [3] are shown.

When the method of FIG. 16 is performed, category [2] node 318 isdiscovered, and removed, as shown in graph 314. In graph 312, edge 320connects category [1] node 322 to node 318. In graph 314 it is apparentthat edge 320 ends blindly.

When the sequence beginning with step 298 (FIG. 16) is performed, edges324, 326 are found to connect category [1] node 328 with category [3]nodes, and are therefore reported as unsafe. Edge 320 is not reported asbeing unsafe. It is concluded that category [1] node 328 constitutes asecurity vulnerability. As shown in Table 3, modifications of thistechnique allow vulnerabilities involving several types of injections tobe discovered. Table 3 illustrates categorization of node types relatingto respective forms of injection.

TABLE 3 Type [1] [2] [3] SQL Interactive Sanitization function DB accesscommands Injection inputs (e.g., prepared statements) Cross siteInteractive Sanitization function Web screen output scripting inputs(Eg. HTMLEncode) Command Interactive Sanitization function Operatingsystem Injection inputs (e.g., removing meta- direct access characters)commands LDAP Interactive Sanitization function LDAP access Injectioninputs (e.g., removing LDAP command meta-characters) ReflectionInteractive Sanitization function Reflection commands injection inputs(e.g., global variables removal) Path Interactive Sanitization functionFile access commands manipulation inputs (e.g., path meta- charactersremoval)

Persistent Attacks

Persistent attacks occur in two stages. In the first stage stores adangerous payload on the server. The second stage, typically deferred,causes the payload to execute. Deferral of the effect makes it verydifficult to locate the vulnerability manually. The method describedwith respect to FIG. 16 and FIG. 17 is capable of finding suchvulnerabilities when modified by categorizing storage functions (insteadof entry points) as category [1].

By modifications that will be evident to those skilled in the art, byretrieving data directly from a database instead of dealing withinteractive inputs as in the discussion of unvalidated input, the methodis capable of detecting the following vulnerabilities: Second order SQLInjection; persistent SQL Injection; intersystem attacks; and persistentcross-site scripting attacks (XSS attacks).

Example 8

Consider the code fragment of Listing 11. The code queries a databasefor the name of the person with an id of 3. Then, in a second query, itobtains his rank based on the retrieved name. Even if the name wassanitized before it was written to the database, e.g., by enclosure indouble quotes, a single quoted name, e.g., (“O'Brian”) will be retrievedfrom the database. Depending on the nature of the application, and themanner in which the stored data is rendered or executed, the secondquery is subject to a form of attack, which is sometimes referred to as“Second Order SQL Injection”.

Least Privilege

Least privilege is a well-known term in IT security. The idea behind itis that an entity, whether a user, an application or a service, shouldhave only the privileges needed to make it work correctly, and nothingmore. Although the idea is simple, its implementation is laborintensive. This difficulty is alleviated by using the query languagedescribed above with reference to Table 2 to identify privileges andautomatically create a configuration file that specifies suchprivileges.

Scanning the code and denying access to program objects to which accessis not needed by the application or by its authorized users preventsunauthorized use as well.

Example 9

The following statement is an entry in a least privilege configurationfile, which removes access permissions to the “xp . . . cmdshell” storedprocedure. Such access permissions constitute a vulnerability that mayallow remote access to database servers.

If (!Code.Execute(SQLStoredProcedure(“xp . . . cmdshell”)))SQLScript.Add(RemovePermissions (“xp . . . cmdshell”)

The above query creates a .SQL configuration file, containing thecontent exec sp_dropextendedproc ‘xp_cmdshell’.

This removes the stored procedure xp..cmdshell.

Least Privilege

The well known file system NTFS (NT File System) allows permissions tobe defined for specific files and folders. By querying the code, usingthe above described query language, it is possible to learn which filesand folders are accessed by the application, and what kind of access isneeded. Anything else can be denied. The SCA engine 22 (FIG. 2) presentsa dialog box that allows an operator to manually configure permissions.The same approach can be applied to file systems other than NTFS.

Logical Flaws

Logical flaws are unique to a specific application. These are codingerrors that do not comply with the application's specification. Suchflaws can be detected using the above-described query language combinedwith the SCA engine 22 (FIG. 2).

This technique exposes many types of vulnerabilities that stem fromlogical flaws, for example flaws that violate the business logic asspecified for the application. One obvious example is the display ofconfidential information, such as passwords, credit card numbers, andsocial security numbers. Other examples include forgotten debuggingcode, orders with negative quantity, and backdoors.

Example 10

The following statement is a query that was executed on an open-sourcebookstore, in order to find a logical vulnerabilities wherein an userlacking administrative privileges is allowed to see another user'sorders, although he is not the administrator:

Result=FindPlacesWhere(OrdersRetrievedFromDB && pagePermission!=Administrator && datallotInfluencedBy(userId). Pages Without GUIAccess

Pages that are accessible from the Internet, but can not be accessedfrom the UI usually mean there was use of the “security by obscurity”technique, that is secrecy of design or, implementation to providesecurity. This approach admits that an application may have securityvulnerabilities, but relies on the belief that the flaws are not known,and that attackers are therefore unlikely to find them. Identificationof this disfavored approach alerts the operator that the application mayindeed have latent security vulnerabilities and indicates the need forparticular scrutiny

Example 11

The following query detects a vulnerability of the above-described type:

FindAllPages—FindUIPageAccessCommand.AccessedPage.

Display of Confidential Information.

Some variables should always be retrieved from the user, and neverdisplayed, e.g., passwords, credit card numbers. In one vulnerability,“hidden” fields on a web page are displayable using a browser's “viewsource” option.

Example 12

The following query detects a vulnerability that would permit display ofconfidential information:

All.FindSensitiveVariables( ).DataInfluencingOn(Find_Outputs( )).

I/O From Virtual Directory.

I/O operations applied to a virtual directory may expose data, since avirtual directory, unless configured otherwise, is likely to enable readoperations by all users.

Example 13

The following query detects this vulnerability:

Find_File_Access( )NotDataInfluencedBy(AbsolutePath. Data ValidationFunctions

Data validation functions are well known in the art. Despite theiravailability, a programmer may develop a proprietary input validationfunction. The SCA engine 22 (FIG. 2) employs the software fault treeanalysis (SFTA), a known technique. SFTA is discussed, for example, inthe document A Software Fault Tree Approach to Requirements Analysis ofan Intrusion Detection System, Guy Helmer et al. (2001), which waspublished on the Internet.

Referring again to FIG. 2, SFTA engine 68 in layer 66 verifies thecompetence of data validation functions that may be found in the code.

Reference is now made to FIG. 18, which is a diagrammaticallyillustrates processing of an exemplary proprietary data validationfunction in accordance with a disclosed embodiment of the invention.This occurs in the layer 66 (FIG. 2). The function, shown as source code330 is intended to replace occurrences of the character “'” in a strings. Source code 330 contains a “for” loop, an “if” block, and anassignment statement.

Assume that prior to executing the source code, the string s contains anapostrophe in its Nth position. It is desired to determine if theapostrophe remains after completion of the for loop in the source code330. We start by observing that there are three possible paths throughthe code:

1. The program does not enter the “for” loop.

2. The program enters the “for” loop but for some reason the “if”statement never evaluates as “true”.

3. The program enters both the “for” loop and the “if” block, but theassignment expression leaves the apostrophe in place.

These possibilities are shown in a graph 332. The third possibility,indicated by block 334 in the graph 332 is impossible and need not beconsidered further for purposes of the SCA engine 22.

Consider the option shown in block 336, corresponding to the firstpossible path. The “for” loop does not execute if (s.length-1)<1), orequivalently, if the length of string s is less than 2. In case “s” is asingle-character string that contains only an apostrophe, the functionwill fail.

The second possible path is represented by block 338. Although the “for”loop has been entered, the “if” statement always return “false”, eventhough the string contains an apostrophe at the Nth position. This willhappen only if “i” never reaches “N”, which occurs if

N<0 or (N>=s.length−1). In other words, the function will fail if anapostrophe occurs at the end of a string that exceeds one character inlength.

The entire process is shown in a composite graph 340, in which two flaws342, 344 are circled.

Automatic Unit Testing

Referring again to FIG. 2, by inserting a data validation function, forexample the function shown in FIG. 17, into the unit testing engine 72in layer 70 and testing it automatically, scenarios may be identified inwhich the data validation functions fails. Using a conventional testgenerator a function to be evaluated is embedded into a testapplication, test cases generate. When the test application executes,the outputs are presented to a validation engine, and the results canindicate a security vulnerability.

Broken Access Control

In this vulnerability, restrictions on what authenticated users areallowed to do are not properly enforced. For example, attackers canexploit such vulnerabilities to access other user accounts, viewsensitive files, or use unauthorized functions. In locating suchvulnerabilities, queries can be designed, using the above-describedquery language, to locate pages that are called only when compliancewith certain criteria are required, e.g., user authorization, but whichare not checked during user interactions with such pages.

Example 14

In this example, a page named “/admin” is called only when variableIsAdmin=1. However, the page itself does not check for that condition,and explicitly calling it will result in broken access control. Thequery found in the procedure shown in Listing 12 detects thevulnerability.

Automatic Discovery of Protection Methodology

Some of the queries mentioned above require the user to supply someinformation about the application, e.g., what function is used tosanitize input, where key cryptographic information is stored. So-called“helping queries” can be used in order to find answers to thesequestions automatically. For example, a query that reveals the dataaccess layer (DAL) methodology may help in the identification of SQLinjection vulnerabilities without the need of the user to explicitlydefine the DAL methodology.

Fine Tuning Issues

The SCA engine 22 (FIG. 2) assists the vulnerability discovery processusing a source-sensitive wizard to develop queries. In addition to thebasic information, each built-in query of the SCA engine 22 hasconditions that determine whether the query should be executed. Thiswizard is source-code sensitive wizard, in that it asks relevantquestions based to determine whether such conditions are satisfied forthe particular source code. For example, if the application does notaccess the database, all relevant database questions can be omitted.

It will be appreciated by persons skilled in the art that the presentinvention is not limited to what has been particularly shown anddescribed hereinabove. Rather, the scope of the present inventionincludes both combinations and sub-combinations of the various featuresdescribed hereinabove, as well as variations and modifications thereofthat are not in the prior art, which would occur to persons skilled inthe art upon reading the foregoing description.

COMPUTER PROGRAM LISTINGS Listing 1

  .. .. .. . i = 5; .. .. .. . i = 6; .. .. .. .

Listing 2

  1. class myClass 2. { 3. public int x; 4. } 5. class Run 6. { 7. voidfoo( ) 8. { 9.  int i,j; 10.  myClass a = new myClass( ); 11.  myClass b= new myClass( ); 12.  i = 1; 13.  j = 2; 14.  a.x = 3; 15.  b.x = 4;16.  } 17. }

Listing 3

  Class cs1 {  Int i = 3;  Int j = 5;  Public cs1( )  {   Int a = 6;  } Public cs1(int p)  {   Int a = p;  } }

Listing 4

  Class cs1 {  Int i,j;  private MetaConstructor( )  {   i = 3;   j = 5; }  Public cs1( ):MetaConstructor( )  {   Int a = 6;  }  Public cs1(intp):MetaConstrcutor( )  {   Int a = p;  } }

Listing 5

  Public class myClass {  Public int var1;  Public int var2; } Publicclass mySecondClass {  Public void func(myClass ins)  {  Session[“Hello”] = “Information”;  }  Int var3; }

Listing 6

DOM d = BuildDomFromFile (@“C: code.cs”); OODB db = Store(d); db.Select (“Select CalledFunctionName, CallingFunctionName  from MethodInvokes”)

Listing 7

  OODB db = OpenDB(@“C:\Project1.db”); db.Update (“Update Field setAttributes = Private where Attributes = Public”); DOM d =BuildDomFromDB(db); d.WriteSelf(@“C:\output.cs”);

Listing 8

  1. int a = 3; 2. int b = 5; 3. b = a++; 4. Write(a)

Listing 9

  1. int a = 3; --------- 3. b = a++; 4. Write(a)

Listing 10

  1. int a = 3; --------- 3. a++; 4. Write(a)

Listing 11

  a = “Select name from table where id = 3” r = “Select rank from tablewhere name = ′ ” + a + “ ′ ”

Listing 12

  For each (PageAccess pa in PagesAccess) {  Ifs[pa] = Conditions.ControlInfluencingOn(pa); } IntersectionIfs = Intersect(Ifs); if(!CurrentPage.Contains(IntersectionIfs)) {  _report_a_vulnerability_ }

1-66. (canceled)
 67. A computer-implemented method for evaluating acomputer program, the method comprising: receiving, into a memory of acomputer, source code of the computer program to be analyzed;constructing, using a source code analyzer running on the computer, adata flow graph (DFG) of the computer program; receiving in the computera query, directing the computer to identify unvalidated inputs in thecomputer program; assigning the nodes in the DFG to three categories,consisting of a first category corresponding to entry points of inputdata, a second category corresponding to input validation functions, anda third category corresponding to locations where the input data isconsumed; responsively to the query, traversing the DFG to identify thenodes in the second category, and removing the identified nodes in thesecond category and the edges connected to the identified nodes from theDFG; after removing the identified nodes and the edges, traversing theDFG to detect a node in the first category having an edge extending toone of the nodes in the third category; and reporting the detected nodeas an unvalidated input vulnerability.
 68. The method according to claim67, wherein constructing the data flow graph comprises: constructing anobject-oriented model of the source code; using the model, constructinga control flow graph of the computer program; and deriving the data flowgraph from the control flow graph.
 69. The method according to claim 67,wherein the entry points correspond to interactive inputs to thecomputer program, and the input validation functions sanitize theinteractive inputs.
 70. The method according to claim 67, wherein thelocations where the input data is consumed correspond to database accesscommands, and wherein the unvalidated input vulnerability comprises anSQL injection vulnerability.
 71. The method according to claim 67,wherein the locations where the input data is consumed correspond to Webscreen outputs, and wherein the unvalidated input vulnerabilitycomprises a cross-site scripting vulnerability.
 72. The method accordingto claim 67, wherein the locations where the input data is consumedcorrespond to operating system direct access commands, and wherein theunvalidated input vulnerability comprises a command injectionvulnerability.
 73. The method according to claim 67, wherein thelocations where the input data is consumed correspond to LDAP accesscommands, and wherein the unvalidated input vulnerability comprises anLDAP injection vulnerability.
 74. The method according to claim 67,wherein the locations where the input data is consumed correspond toreflection commands, and wherein the unvalidated input vulnerabilitycomprises a reflection injection vulnerability.
 75. The method accordingto claim 67, wherein the locations where the input data is consumedcorrespond to file access commands, and wherein the unvalidated inputvulnerability comprises a path manipulation vulnerability.
 76. Themethod according to claim 67, and comprising modifying the source codeto remove the security vulnerability.
 77. A data processing system,comprising: a memory, having program instructions stored therein; an I/Ofacility; and a processor accessing the memory to read the instructions,which cause the processor to receive, via the I/O facility, source codeof a computer program to be analyzed, to construct, by analyzing thesource code, a data flow graph (DFG) of the computer program, to receivea query, directing the system to identify unvalidated inputs in thecomputer program, to assign the nodes in the DFG to three categories,consisting of a first category corresponding to entry points of inputdata, a second category corresponding to input validation functions, anda third category corresponding to locations where the input data isconsumed, to traverse the DFG, responsively to the query, in order toidentify the nodes in the second category, and remove the identifiednodes in the second category and the edges connected to the identifiednodes from the DFG, and after removing the identified nodes and theedges, to traverse the DFG to detect a node in the first category havingan edge extending to one of the nodes in the third category and toreport the detected node as an unvalidated input vulnerability.
 78. Thesystem according to claim 77, wherein constructing the data flow graphcomprises: constructing an object-oriented model of the source code;using the model, constructing a control flow graph of the computerprogram; and deriving the data flow graph from the control flow graph.79. The system according to claim 77, wherein the entry pointscorrespond to interactive inputs to the computer program, and the inputvalidation functions sanitize the interactive inputs.
 80. The systemaccording to claim 77, wherein the locations where the input data isconsumed correspond to database access commands, and wherein theunvalidated input vulnerability comprises an SQL injectionvulnerability.
 81. The system according to claim 77, wherein thelocations where the input data is consumed correspond to Web screenoutputs, and wherein the unvalidated input vulnerability comprises across-site scripting vulnerability.
 82. The system according to claim77, wherein the locations where the input data is consumed correspond tooperating system direct access commands, and wherein the unvalidatedinput vulnerability comprises a command injection vulnerability.
 83. Thesystem according to claim 77, wherein the locations where the input datais consumed correspond to LDAP access commands, and wherein theunvalidated input vulnerability comprises an LDAP injectionvulnerability.
 84. The system according to claim 77, wherein thelocations where the input data is consumed correspond to reflectioncommands, and wherein the unvalidated input vulnerability comprises areflection injection vulnerability.
 85. The system according to claim77, wherein the locations where the input data is consumed correspond tofile access commands, and wherein the unvalidated input vulnerabilitycomprises a path manipulation vulnerability.
 86. The system according toclaim 77, wherein the processor is configured to modify the source codeto remove the security vulnerability.
 87. A computer software product,comprising a tangible, non-transitory computer-readable medium in whichprogram instructions are stored, which instructions, when read by acomputer, cause the computer to receive source code of a computerprogram to be analyzed, to construct, by analyzing the source code, adata flow graph (DFG) of the computer program, to receive a query,directing the system to identify unvalidated inputs in the computerprogram, to assign the nodes in the DFG to three categories, consistingof a first category corresponding to entry points of input data, asecond category corresponding to input validation functions, and a thirdcategory corresponding to locations where the input data is consumed, totraverse the DFG, responsively to the query, in order to identify thenodes in the second category, and remove the identified nodes in thesecond category and the edges connected to the identified nodes from theDFG, and after removing the identified nodes and the edges, to traversethe DFG to detect a node in the first category having an edge extendingto one of the nodes in the third category and to report the detectednode as an unvalidated input vulnerability.